31 research outputs found

    Perplexity-free Parametric t-SNE

    Full text link
    The t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm is a ubiquitously employed dimensionality reduction (DR) method. Its non-parametric nature and impressive efficacy motivated its parametric extension. It is however bounded to a user-defined perplexity parameter, restricting its DR quality compared to recently developed multi-scale perplexity-free approaches. This paper hence proposes a multi-scale parametric t-SNE scheme, relieved from the perplexity tuning and with a deep neural network implementing the mapping. It produces reliable embeddings with out-of-sample extensions, competitive with the best perplexity adjustments in terms of neighborhood preservation on multiple data sets.Comment: ESANN 2020 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Online event, 2-4 October 2020, i6doc.com publ., ISBN 978-2-87587-074-2. Available from http://www.i6doc.com/en

    Tuning Database-Friendly Random Projection Matrices for Improved Distance Preservation on Specific Data

    Get PDF
    [EN] Random Projection is one of the most popular and successful dimensionality reduction algorithms for large volumes of data. However, given its stochastic nature, different initializations of the projection matrix can lead to very different levels of performance. This paper presents a guided random search algorithm to mitigate this problem. The proposed method uses a small number of training data samples to iteratively adjust a projection matrix, improving its performance on similarly distributed data. Experimental results show that projection matrices generated with the proposed method result in a better preservation of distances between data samples. Conveniently, this is achieved while preserving the database-friendliness of the projection matrix, as it remains sparse and comprised exclusively of integers after being tuned with our algorithm. Moreover, running the proposed algorithm on a consumer-grade CPU requires only a few seconds.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    Improving Individual Predictions using Social Networks Assortativity

    Get PDF
    Social networks are known to be assortative with respect to many attributes, such as age, weight, wealth, level of education, ethnicity and gender. This can be explained by influences and homophilies. Independently of its origin, this assortativity gives us information about each node given its neighbors. Assortativity can thus be used to improve individual predictions in a broad range of situations, when data are missing or inaccurate. This paper presents a general framework based on probabilistic graphical models to exploit social network structures for improving individual predictions of node attributes. Using this framework, we quantify the assortativity range leading to an accuracy gain in several situations. We finally show how specific characteristics of the network can improve performances further. For instance, the gender assortativity in real-world mobile phone data changes significantly according to some communication attributes. In this case, individual predictions with 75% accuracy are improved by up to 3%

    Robust and fast neighbor embedding algorithms

    No full text
    In numerous machine learning (ML) settings, complex data mining tasks necessitate user interaction and cannot be completely automated. This interaction can take place in the context of a data exploratory phase, during which data visualization helps determining and refining the application needs. Since most ML databases are nowadays high-dimensional (HD), their visualization entails considering approaches from nonlinear dimensionality reduction (DR). This field aims at creating meaningful low-dimensional (LD) versions of HD data. The advent of neighbor embedding (NE) techniques impressively improved state-of-the-art DR performances. Nevertheless, the data exploratory requirements in current ML applications involve generalizing modern NE algorithms in several aspects. Namely, they should be robustly able to handle unconventional data types, such as incomplete databases which are omnipresent in data analysis. Also, they must be sufficiently fast to process very large data sets, being ubiquitous presently. This thesis contributes to both of these aspects. Regarding the ability to deal with incomplete data sets, common missing data imputation techniques are not suited to nonlinear DR, as they at best enable applying a DR scheme on the expected database. Since NE approaches are nonlinear, this differs from minimizing their expected cost function. The thesis addresses this limitation by proposing a general methodology to compute the LD embedding minimizing the cost function expectation, thanks to the multiple imputation framework. As to the development of fast NE schemes, multi-scale techniques are of great interest among NE methods as they account for the global HD organization to define the LD space, delivering outstanding DR quality. Their time complexity in the number of data samples however prevents tackling large-scale databases. The thesis addresses this difficulty by presenting fast multi-scale NE algorithms which account for the dense nature of the multi-scale similarities, providing high quality embeddings of very big data sets. The robust and fast NE algorithms designed in this thesis hence open the path to enhanced HD data exploration in ML through visualization.(FSA - Sciences de l'ingénieur) -- UCL, 202

    fmsne: fast multi-scale neighbour embedding in R

    No full text
    Dimensionality reduction (DR) has been a workhorse of large scale, multivariate omics data analysis from the early days. Since the advent of single-cell RNA sequencing, non-linear approaches have taken the front stage, with t-distributed stochastic neighbour embedding (t-SNE) [1,2] being one of, if not the main player. Packages such as `Rtsne` [3] and `scater` [4] have made it easy to apply t-SNE in R/Bioconductor workflows. One sticking point with t-SNE is the single perplexity parameter, that controls the number of nearest high-dimensional (HD) neighbours that are taken into account when constructing the low-dimensional (LD) embedding: small (resp. large) values only enable preserving small (resp. large) neighbourhoods from HD to LD during DR, impairing the reproduction of large (resp. small) neighbourhoods. It is thus a key parameter, especially if the LD embedding is used for interpretation, which is often the case in omics-based applications. Multi-scale neighbour embedding [5] is an extension to single-scale approaches such as t-SNE, that exempt users from having to set a single perplexity (scale) arbitrarily. Multi-scale approaches maximise the LD embedding quality at all scales, preserving both local and global HD neighbourhoods [6]. They have been shown to better capture the structure of data and to significantly improve DR quality [7]. Here, we present `fmsne` (https://github.com/lgatto/fmsne), an R package that relies on the `basiliks` package [8] to provide Bioconductor-friendly interface to fast multi-scale methods implemented in python. `fmsne` implements fast multi-scale functions such as `runFMSTSNE()` and `plotFMSTSNE()`, based on scater's `scater::run*()` and `scater::plot*()` interface [4]. It also exposes the `drQuality()` function to assess DR quality using rank-based criteria [7]. Finally, we illustrate fast multi-scale methods on various single-cell datasets. [1] van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. _Journal of Machine Learning Research_, 9(Nov), 2579-2605. [2] van der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. _Journal of Machine Learning Research_, 15(1), 3221-3245. [3] Jesse H. Krijthe (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation, URL: https://github.com/jkrijthe/Rtsne [4] McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R. _Bioinformatics_, 33, 1179-1186. doi:10.1093/bioinformatics/btw777 [5] C. de Bodt, D. Mulders, M. Verleysen and J. A. Lee, 'Fast Multiscale Neighbor Embedding,' in _IEEE Transactions on Neural Networks and Learning Systems_, 2020, doi: 10.1109/TNNLS.2020.3042807. [6] Lee, J. A., Peluffo-Ordóñez, D. H., & Verleysen, M. (2015). Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure. _Neurocomputing_, 169, 246-261. [7] Lee, J. A., & Verleysen, M. (2009). Quality assessment of dimensionality reduction: Rank-based criteria. _Neurocomputing_, 72(7-9), 1431-1443. [8] Lun ATL (2022). basilisk: a Bioconductor package for managing Python environments. _Journal of Open Source Software_, 7, 4742. doi:10.21105/joss.04742

    Don't skip the skips: autoencoder skip connections improve latent representation discrepancy for anomaly detection

    No full text
    Reconstruction-based anomaly detection typically relies on the reconstruction of a defect-free output from an input image. Such reconstruction can be obtained by training an autoencoder to reconstruct clean images from inputs corrupted with a synthetic defect. Previous works have shown that adopting an autoencoder with skip connections improves reconstruction sharpness. However, it remains unclear how skip connections affect the latent representations learned during training. Here, we compare internal representations of autoencoders with and without skip connections. Experiments over the MVTec AD dataset reveal that skip connections enable the autoencoder latent representations to intrinsically discriminate between clean and defective images

    Fine-tuning is not (always) overfitting artifacts

    No full text
    Since their release, transformers, and in particular fine-tuned transformers are widely used for text related classification tasks. However, only a few studies try to understand how fine-tuning actually works and existing alternatives, such as feature-based transformers, are often overlooked. In this work, we study a French transformer model, CamemBERT, to compare the fine-tuned and feature-based approaches in terms of their performances, interpretability and embedding space. We observe that while fine-tuning has a limited impact on performances in our case study, it significantly affects the intepretability (by better isolating words that are intuitively connected to the classification task) and embedding space (by summarizing the majority of the relevant information into a fewer dimensions) of the results. We conclude by highlighting open questions regarding the generalization potential of fine-tuned embeddings

    SQuadMDS: a lean Stochastic Quartet MDS improving global structure preservation in neighbor embedding like t-SNE and UMAP

    No full text
    Multidimensional scaling is a process that aims to embed high dimensional data into a lower-dimensional space; this process is often used for the purpose of data visualisation. Common multidimensional scaling algorithms tend to have high computational complexities, making them inapplicable on large data sets. This work introduces a stochastic, force directed approach to multidimensional scaling with a time and space complexity of O(N), with N data points. The method can be combined with force directed layouts of the family of neighbour embedding such as t-SNE, to produce embeddings that preserve both the global and the local structures of the data. Experiments assess the quality of the embeddings produced by the standalone version and its hybrid extension both quantitatively and qualitatively, showing competitive results outperforming state-of-the-art approaches. Codes are available at https://github-com.proxy.bib.ucl.ac.be:2443/PierreLambert3/SQuaD-MDS-and-FItSNE-hybrid

    Proximities in dimensionality reduction

    No full text
    International audienceDimensionality reduction aims at representing high-dimensional data in a lower-dimensional representation, while preserving their structure (clusters, outliers, manifold). Dimensionality reduction can be used for exploratory data visualization, data compression, or as a preprocessing to some other analysis in order to alleviate the curse of dimensionality. Data structure is usually quantified with indicators, like covariance between variables, or pairwise proximity relationships, like scalar products, distances, similarities, or neighbourhoods. One objective of this chapter is to provide an overview of some classical and more recent methods of dimensionality reduction, to shed some light on them from the perspective of analyzing proximities, and to illustrate them with multivariate data that could be typically encountered in social sciences. Complementary aspects like quality assessment and alternative metrics are briefly developed

    Nesterov momentum and gradient normalization to improve t-SNE convergence and neighborhood preservation, without early exaggeration

    No full text
    Student t-distributed stochastic neighbor embedding (t-SNE) finds low-dimensional data representations allowing visual exploration of data sets. t-SNE minimises a cost function with a custom two-phase gradient descent. The first phase is called early exaggeration and involves a hyper-parameter whose value can be tricky and time-consuming to set. This paper proposes another way to optimise the cost function without early exaggeration. Empirical evaluation shows that the proposed method of optimization converges faster and yields competitive results in terms of neighborhood preservation
    corecore